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FOREWORD 

This research was performed under exploratory development work unit RF63- 
522-801-013-03-04, Testing Strategies for Operational Computer-based Training, under 
the sponsorship of the Office of Naval Technology, and advanced development project 
Z1772-ET008, Conaputer-Based Performance Testing, under tiie sponsorship of Deputy 
Chief of Naval Op^tions (Manpower, Personnel, and Training). The general goal of 
this development is to create and evaluate computer-based simulations of operationally 
oriented tasks to detennine if they result in better assessment of student performance 
than more customary measurement methods. 

The results of this study are primarily intended for the Department of Defense 
training and testing research and development community. 



B. E. BACON 
Captain, U.S. Navy 
Conunanding Officer 



J. S. MCMICHAEL 
Technical Director 



SUMMARY 



Background 

The literature regarding computer-based assessment is contradictory and incon- 
clusive: Many benefits may be obtained from computerized testing. Some of these may 
be related to attitudes and assumptions associated with the use of novel media or inno- 
vative technology per se. However, and just as readily, potential problems may result 
from the employment of computer-based measurement Differences between this mode 
of assfessment and traditional tesdng techniques may, or may not, impact upon the 
reliability and validity of measurement Notably absent from this literature are studies 
that have compared these testing characteristici> of computer-based assessment with 
customary measurement methods for estimating recognition performance. 



Problem 

Many student assessment schemes which are currendy used in Navy training are 
suspected of being msufficiendy accurate or consistent If true, this could result in 
either overtraining, which increases costs needlessly, or undertraining, which cul- 
minates in unqualified graduates being sent to the fleets. 

Objective 

The specific objective of this research was to compare the reliability and validity 
of a computer-based and a paper-based procedure for assessing recognition perfor- 
mance. 



Method 

A computer-based and paper-based test were developed to assess recognition of 
Soviet and non-Soviet aircraft silhouettes. These tests were administered to 83 student 
pilots and radar intercept officers from the F-H Fleet Replacement Squadron, VF-124, 
NAS Miramar. All volunteered to participate in this study. After the subjects received 
the paper-based test, they were immediately given the computer-based test. It was 
assumed that a subject's state of recognition knowledge was the same during the 
administration of both tests. 

Reliabilities for both modes of testing were estimated by deriving internal con- 
sistency mdices using an odd-even item split These estimates were adjusted by 
employing the Spearman-Brown Prophecy Formula. Reliability estimates were calcu- 
lated for test score, average degree of confidence, and average response latency for the 
computer-based test; reliability estimates were calculated for test score and average 
degree of confidence only for the paper-based test. None was computed for average 
response latency since this was not measured for the paper-based test. Equivalences 
between these two modes of assessment were estimated by Pearson-product-moment 
correlations for total test score and average degree of confidence. 



in order to derive discriminant validity estimates, subjects were placed into two 
gK^ups according to whether or not their performance through the squadron's curricu- 
lum was above or below the mean average grade for this sample. A stepwise multiple 
discriminant, analysis, using Wilks' criterion for including and rejecting variables, and 
their associated statistics were computed to ascertain how well computer-based and 
paper-based measures distinguished among the defined groups expected to differ in 
their recognition of aircraft silhouettes. Predictive validity indices were obtained by 
computing a canonical analysis between computer-based and paper-based recognition 
measures and subjects' test scores for each phase of the curriculum. 



Results 

It was demonstrated that (a) computer-based and paper-based measures of recog- 
nition test score were not significanUy different in reliabUity or internal consistency, 
(b) the paper-based measure of average degree of confidence in recognition judgments 
was more reliable or internally consistent than the computer-based measure, (c) 
compurer-based and paper-based measures of average degree of confidence were more 
equiy^ent than these measures of recognition test score, (d) according to two sets of 
criteria, the discriminant coefficients and F-ratios and corresponding means, the 
discriminative validities of computer-based and paper-based measures were about the 
same for distinguishing groups above or below mean average curriculum grade, (e) 
according to anomer set of criteria, the pooled witiiin-groups coitelations between the 
discriminant fiinction and computer-based and paper-based measures, the former had 
superior discriminative validity than the latter, and (f) statistics associated with the 
canonical correlation suggested the predictive validity of computer-based measures 
approximates that of paper-based measures. 



Discussion 

This study established that the relative reliability of computer-based and paper- 
based measures depends upon the specific criterion assessed. That is, regarding the 
recognition test score itself, it was found that computer-based and paper-based meas- 
ures were not significanUy different in reliability or internal consistency. However, 
regarding the average degree of confidence in recognition judgments, it was found that 
the paper-based measure was more reliable or internally consistent than its computer- 
-based counterpart The extent of the equivalence between these two modes of , measure- 
ment was contingent upon particular performance criteria. It was demonstrated that 
the equivalence of computer-based and paper-based measures of average degree of 
corifidence was greater than that for recognition test score. The relative discriminative 
vaUdity of computer-based and paper-based m-asures was dependent upon the specific 
statistical criteria selected. The discriminant coefficients, F-ratios, and corresponding 
means indicated that the vaUdities of computer-based and paper-based measures were 
about the saine for distinguishing groups above or below mean average curriculum 
grade. However, according to another set of criteria, the pooled within-groups correla- 
tions between the discriminant fiinction and computer-based and paper-based measures, 
the former had superior validity than the latter. Also, according to the statistics 
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associaied with the canonical correlation, this study demonstrated that the predictive 
validity of computer-based measures approximates that of paper-based measures. The 
results of this research supported the findings of some studies, but not others. As was 
discussed, the reported literature on this subject is contradictory and inconclusive. 

RECOG, the computer-based system for assessing recognition performance, 
together with the Soviet and non-Soviet aircraft silhouette database, is referred to as 
FLASH IVAN. This system is currentiy being used to augment the teaching and test- 
ing of this subject matter in VF-124 RECOG was designed and developed with gen- 
eralizability (i.e., independence of subject-matter domain) and transferability (i.e., capa- 
ble oif readily running on different computer systems) in mind as was the Computer- 
Based Educational Software System (CBESS) (Brandt, 1987; Brandt, Gay, Othmer & 
Halff, 1987). CBESS consists of a number of component quizzes such as JEO- 
PARDY, TWENTY QUESTIONS, and CONSTRAINT. Since the time that RECOG 
and FLASH IVAN were developed and evaluated, two other tests, FLASH and PIC- 
TURE, were added to CBESS. Li terms of their function, these additional quizze.s are 
similar to RECCX} and FLASH IVAN. However, these more recentiy produced tests 
are written in the "C" programming language for the Navy standard microcomputer. 
Zenith Z-248. Consequentiy, since TERAK computers are no longer being produced, 
the company went out of business, FLASH and PICTURE can be perceived as replace- 
ments for RECOG and FLASH IVAN. 



Recommendations 

Based upon the. findings of this study, the following actions are recommended: 

(a) Commander, Naval Air Force, U.S. Pacific Fleet, use FLASH and PICTURE 
to supplement the training and testiing of fighter and other crew members to recognize 
Soviet and Non-Soviet silhouettes. 

(b) Chief of Naval Operations, Total Force Training and Education, fund the 
evaluation and seek implementation of FLASH and PICTURE in other content areas or 
subject-matter domains (e.g., ship silhouettes, electronic schemata, human anatomy) to 
ascertain the universality of the validity and reliability results established in this 
reported research. 
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INTRODUCTION 



Background 

The consequences of computer-based assessment on examinees* peiformance are 
not obvious. The investigations that have been conducted on Uiis topic have produced 
mixed results. Some studies (Serwer & Stolurow, 1970; Johnson & Mihal, 1973) 
demonstrated that testees do better on verbal items given by computer than paper- 
based; however, just the opposite was found by other studies (Johnson & Mihal, 1973; 
Wildgrube, 1982). One investigation (Sachar & Fletcher, 1978) yielded no significant 
differences resulting from computer-based and paper-based modes of administration on 
verbal items. Two studies (English, Reckase & Patience, 1977; Hoffman & Lundberg, 
1976) demonstrated that these two testing modes did not effect performance on 
memory retrieval items. Sometimes (Johnson & Mihal, 1973) testees do better on 
quantitative tests when computer given, sometimes (Lee, Moreno, & Sympson, 1984) 
they do worse, and other times (Wildgrube, 1982) it may make no difference. Other 
studies have supported the equivalence of computer-based and paper-and-paper 
administration (Elwood & Griffin, 1972; Hedl, O'Neil, & Hansen, 1973; Kantor, 1988; 
Luldn, Dowd, Plake, & Kraft, 1985). Some researchers (Evan & Miller, 1969; Koson, 
Kitchen, Kochen, & Stodolosky, 1970; Lucas, MuUin, Luna, & Mclnroy, 1977; Lukin, 
Dowd, Plake, & Kraft, 1985; Skinner & Allen, 1983) have reported comparable or 
superior psychometric capabilities of computer-based assessment relative to paper- 
based assessment in clinical settings. 

Investigations of computer-based administration of personality items have yielded 
reliability and validity indices comparable to typical paper-based administration (Katz 
& Dalby, 1981; Lushene, 0*Neil, & Dunn, 1974). No significant differences were 
found in the scores of measures of anxiety, depression, and psychological reactance 
due to computer-based and paper-based administration (Lukin, Dowd, Plake, & Kraft, 
1985). Studies of cognitive tests have provided inconsistent findings with some (Hitti, 
Riffer, & Stuckles, 1971; Rock & Nolen, 1982) demonstrating that the computerized 
version is a viable alternative to the paper-based version. Other research (Hansen & 
O'Neil, 1970; Hedl, O'Neil, & Hansen, 1973; Johnson & White, 1980; Johnson & 
Johnson, 1981), Uiough, indicated that interacting with a computer-based system to 
take an intelligence test could elicit a considerable amount of anxiety which could 
affect performance. 

Regarding computerized adaptive testing (CAT), some empirical comparisons 
(McBride, 1980; Sympson, Weiss, & Ree, 1982) yielded essentially no change in vali- 
dity due mode of administration. However, test-item difficulty may not be indifferent 
to manner of presentation for CAT (Green, Bock, Humphreys, Linn, & Reckase, 
1984). When going from paper-based to computer-based administration, this mode 
effect is thought to have tiiree aspects: (a) an^ overall mean shift where aU items may 
be easier or harder, (b) an item-nxxie interaction where a few items may be altered 
and others not, and (c) the nature of the task itself may be changed by computer 
administration. These inconsistent results of mode, manner, or medium of testing may 
be due to differences in metiiodology, test content, population tested, or the design of 
the study (Lee, Moreno & Sympson, 1984). 
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With computer costs coming down and peoples' knowledge of these systems 
going up,, it becomes more likely economically and technologically that many benefits 
can be gained from their use. A direct advantage of computer-based testing is that 
individuals can respond to items at their own pace,, thus producing ideal power tests. 
Some indirect advantages of computer-based assessment are increased test security, 
less ambigjjity about students' responses, minimal or no paperwork, immediate scoring,' 
and automatic records keeping for item analysis (Green, 1983a, 1983b). Some of the 
strongest support for computer-based assessment is based upon the awareness of faster 
and more economical measurement (Elwood & Griffin, 1972; Johnson & White, 1980; 
Space, 1981). Cory (1977) reported some advantages of computerized over paper- 
based testing for predicting on job performance. 

Ward (1984) stated that computers can be employed to augment what is possible 
with paper-based measurement (e.g, to obtain more precise information regarding a 
student than is likely with more customary measu-nnient metnods) and to assess addi- 
tional aspects of performance. He enumerated a discussed potential benefits that may 
be derived from employing computer-based systems to administer traditional tests. 
Some of these are as foUows: (a) individualizing assessment, (b) increasing the flexi- 
bility and efficiency for managing test information, (c) enhancing the economic value 
and manipulation of measurement databases, and (d) improving diagnostic testing. 
Millman (1984) claimed to agree with Ward, especially regarding the ideas that 
computer-based measurement encourages individualizing assessment, designing 
softww-e within the context of cognitive science, and limiting computer-based assess- 
ment is riot so much hardware inadequacy but incomplete comprehension of the 
processes intrinsic to testing and knowing per se (Federico, 1980). 

As is evident, the literature regarding computer-based assessment is contradictory 
and mconclusivc: Many benefits may be obtained from computerized testing. Some of 
these may be related to attitudes and assumptions associated with the use of novel 
media or innovative technology per se. However, and just as readily, potential prob- 
lems may result from the employment of computer-based measurement. Differences 
between this mode of assessment and traditional testing techniques may, or may not, 
impact upon the reUability and validity of measurement Notably absent from this 
literature arc studies that have compared these testing characteristics of computer-based 
assessment with customary measurement methods for assessing recognition perfor- 
mance. 



Problem 

• Many student assessment procedures which are currentiy used 'm N -7 training 
are suspected of being insufficientiy accurate or consistent If true, this could result in 
overtraining, which increases 'costs needlessly, or undertraining, which culminates in 
unqualified graduates being sent to the fleet commands. Many of the customary 
methods for measuring performance either on the job or in the classroom involve 
msmiments which are primarily paper-based in nature (e;g., check lists, rating scales, 
aitical mcidences; and multiple-choice, completion, due-false, and matching formats).' 
A number of deficiencies exist with these traditional testing techniques such as (a) 
biased items are generated by different individuals, (b) item writing procedures are 



2 11 



usually obscure, (c) there is a lack of objective standards for producing tests, (d) item 
content is not typically sampled in a systematic manner, and (e) there is usually a poor 
relationship between what is taught and test content 

What is I ^^uircd is a theoretically and empirically grounded technology of pro- 
ducing procedures for testing which will correct these faults. One promising approach 
employs computer technology. However, very few data are presently available regard- 
ing the psychometric properties of testing strategies using this technology. Data are 
needed concerning the accuracy, consistency, sensitivity, and fidelity of these 
computer-based assessment schemes compared to more traditional testing techniques. 



Objective 

The specific objective of this research was to compare the reliability and validity 
of a computer-based and a paper-based procedure for assessing recognition perfor- 
mance. 



METHOD 



Subjects 

The subjects were 83 male student pilots and radar intercept officers (RIOs) from 
the Fleet Replacement Squadron, VF-124, NAS Miramar, who volunteered to partici- 
pate in this study. This squadron trains crew members to fly die F-14 fighter as well as 
make intercepts using its many complex systems. One of the major missions of the F- 
14 is to protect carrier-based naval task forces against antiship, missile-launching, 
threat bombers. This part of the F-14*s mission is referred to as Maritime Air 
Superiority (MAS), which is taught in the Advanced Fighter Air Superiority (ADFAS) 
curriculum in tfie squadron. It is during ADFAS that students leam to recognize or 
identify Soviet and non-Soviet aircraft silhouettes so that they can employ the F-14 
properly. 



ERIC 



Subject Matter 

The subject matter consisted of line drawings of front, side, and top silhouettes of 
.Soviet and non-Soviet aircraft A paper-based study guide was designed and 
developed for tfie subjects to help them leam to recognize silhouettes of four Soviet 
naval air bombers and ten^of thck front-line fighters. Silhouettes of non-Soviet aircraft 
were also presented since these could be mistaken for Soviet threats or vice versa. 

The silhouettes of Soviet and non-Soviet aircraft appeared on 28 pages of die 
study guide. These were presented so that Soviet aircraft were displayed on a left 
page, and corresponding non-Soviet aircraft on tfic immediately following right page. 
A specific Soviet silliouette appeared on tfie left page either in the top, middle, or bot- 
tom position. The non-Soviet silhouette appeared on the right page in the, correspond- 
ing top, middle, or bottom position. All top views of Soviet and non -Soviet aircraft 
were presented first. These were followed by all side and ft-ont views, respectively. 
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Each Soviet top, side, and ficont view had its own corresponding non-Soviet aircraft. 

Subjects were, asked to study each Soviet silhouette and its corresponding non- 
?ovict silhouette in sequence and note the distinctive features of each. The correct 
identification of each Soviet and non-Soviet silhouette according to NATO name and 
alpha-numeric designator appeared directly below it Subjects were told that in the near 
future, their recognition of these Soviet and non-Soviet aircraft would be assessed via 
computer and traditional, testing. 

In addition to using the paper-based study guide, subjects were required to learn 
the silhouettes via the computer system described below which was configured in a 
training. mode for this purpose. In this mode, when a student pressed the <TAB> key, 
a silhouette would reappear together with its correct identification so that these could 
be associated. 



Computer-Based Assessment 

Graphic models were produced to assess how well the subjects recognized or 
identified the above silhouettes. A computer game based upon a sequential recognition 
paradigm was designed and developed. It randoinly selects and presents on a computer 
di^lay at an arbitrary exposure setting, the front, side, or top views of four Russian 
bombers and 10 of their advanced fighters. For this research, the exposure of a 
silhouette oh the computer screen was approximately 500 milliseconds. Also, the 
game management system can choose and flash corresponding silhouettes of NATO 
aircraft which act as distractors because of their high degree of similarity to the Soviet 
silhouettes. 

Thib particular game, which is called FLASH IVAN (the F-14 community refers 
to the Russians generically as "Ivan"), assesses student performance by measuring: 
their "hit rate" or number of correct recognitions out of a total of 42 silhouettes half of 
which are Soviet and the other half non-Soviet, the time it takes a student or latency to 
make a recognition judgment for each target or distra; or aircraft, and the degree of 
confidence the student has in each of his recognition iecisions. At the end of the 
game, feedback is given to the student in terms of his hit rate (computer-based test 
total percentage correct responses, CTP), average response latency (computer-based 
test total average response latency, CTL), average degree of confidence in his recogni- 
tion judgments (computer-based test total average degree of confidence, CTC), and 
how his performance compares to other students who have playedi the game. 

. A file is maintained and available to the instructors which provides, in addition to 
these parameters for each- student, recognition performance across aircraft for all stu- 
dents who played the game. This provides diagnostic assessments to instructors who 
can use this summative feedback to focus student attention on learning the salient dis- 
tinctive features of certain aircraft in order to improve their recognition performance. 

The game management system is programmed in a modular manner: instructing 
the student on how to play the game, retrieving and displaying individual images, 
keeping track of how well students perform,' providing them feedback, and linking 
these components in order to execute the game. This modularity in programming, 
together with the game management system's independence of any specific graphic 



database (c-g-, ship silhouettes, human anatomy, electronic ciituits, topography), con- 
tributes to its wide applicability. The game, then, provides a set of software tools 
which can be used by others who need to assess recognition performance. This 
computer-based system for assessing recognition performance (RECCXj) has been com- 
pletely documented by Little, Maffly, Miller, Setter, & Federico (1985). 

Paper-Based Assessment 

Two alternative forms of a paper-based test were designed and developed to 
assess the subjects' recognition of tiie silhouettes mentioned above. The alternative test 
forms mimicked as much as possible the fonhat used by FLASH IVAN. Both forms 
of the test were presented as booklets each containing 42 items representing the front, 
top, or side silhouettes of airaraft The subjects* task was to identify as quickly as pos- 
sible the aircraft that was represented by each item's silhouette. They were asked to 
write in the space provided what they recognized the aircraft to be (i.e., its NATO 
name or corresponding alphanumeric designation; e.g., FOXHOUND or MIG-31). 
Misspellings counted as wrong responses. Subjects weie instructed not to tum back to 
previous pages in the test b'"'^^'''* to complete items they had left blank. The students 
were asked to go through the test items as quickly to approximate as much as possible 
the duration of silhouette exposure employed by FLASH IVAN. Subjects were moni- 
tored to assure they complied with this procedure. 

After they wrote down what they thought an aircraft was, they were requked to 
indicate on a scale which appeared below each silhouette the degree of confidence or 
sureness in their recognition decision concerning the specific item. Like the 
confidence scale used for FLASH IVAN, tiiis one went from LEAST CONFIDENT or 
0% CONFIDENCE in their recognition decision on the left, to MOST CONFIDENT 
or 100% CONFIDENGE oh the right, in ten percentage point intervals. Subjects were 
instructed to use this confidence scale by placing a check mark directiy over the per- 
centage of confidence which best reflected or approximated the amount of sureness 
they had in their judgment To leam how to respond properly to ihe silhouette test 
items, the subjects were asked to look at three completed examples. A subject's per- 
centage of correct recognitions (paper-based test total percentage correct responses, 
FTP) and average degree of confidence ^aperrbased test total average degree of 
confidence, PTC) for the paper-based test were measured and recorded. 



Procedure 

Prior to testing, subjects learned to recognize the aircraft silhouettes using two 
media: (a) in paper-based form structured as a study guide, and (b) in computer-based 
form using FLASH IVAN in die training mode. Mode of assessment, computer-based 
or paper-based, was manipulated as a within-subjects variable (Kirk, 1968). All sub- 
jects were administered the paper-based test before the computer-based test The two 
forms of the paper-based tests were alternated in their administration to subjects (i.e., 
the first subject received Form A, the second subject received Form B, the third sub- 
ject received Form A, etc.). After subjects received the paper-based test, they were 
immediately administered the computer-based test. It was assumed that a subject's 



state of recognition knowledge was the same during the administration of both tests. 
Subjects took approximately 10-15 minutes to complete the paper-based test, and 15- 
■ - 20 minutes to. complete the computer-based- test This difference in comoletion time 
was priinarily due to lack of typing proficiency among some of the subjects. 

ReUabiUties for both modes of testing were estimated by deriving internal con- 
sistency indices using an odd-cven item spUt These rcUabiUty estimates were adjusted 
by employing the Spearman-Brown Prophecy Formula (Thomdike. 1982) ReUability 
estiniates were calculated for test score, average degree of confidence, and average 
response latency for the computer-based test; reUabifity estimates were calculated for 
test^score and average degree of confidence only for the paper-based test None was 
computed for average response latency since this was not measured for the paper-based 
test Equivalences between these two modes of assessment were estimated by 
Pearson-product-moment correlations for total test score and average degree of 
confidence. 

in order to derive discriminative vaHdity estimates, research subjects were placed 
mto two groups according to whether or not their performance through the squadron's 
curriculum was above or below the mean average grade for this sample. A stepwise 
multiple discrumnant analysis, using Wilks' criterion for including and rejecting vari- 
ables, and their associated statistics were computed to ascertain how weU computer- 
based and paper-based measures distinguished among tiie defined groups expected to 
differ in tfieir recognition of aircraft silhouettes. 

Predictive vaHdity indices were obtained by computing a canonical analysis 
between computer-based arid paper-based recognition measures and subjects' test 
scores for each phase of die cuniculum: (a) Familiarization Phase (FAM)-A11 aspects 
of the F-14's systems, capabiHties. limitations, and emergency procedures as well as 
fomation, instrument, night, and acrobatics flying; (b) Basic Weapons Employment 
(BWP)-The basics of die F-14's radar and weapon systems and rudimentary intercept 
procedures; (c) Guns (Gun)-The F-14's 20mm gun is taught and tiie-trainees actually 
fire It at a banner towed by anotiier aircraft and at a simulated ground target; (d) 
Advanced Fighter Air Superiority ,(ADF)-Advanced outer air battle- tactics dealing 
with electronic counter measures emphasizing Soviet aircraft, weapons, and tactics as 
weU as U. S. battle group tactics; and (e) Tactics (TAC)-Tactically fighting the F-14 
m several likely combat scenarios against other hostile aircraft 



RESULTS 



Reliability and Equivalence Estimates 

^^J']*^^ <?f reUability and equivalence estimates arc presented in the appendix 
Spht-half reUability and equivalence estimates of computer-based and paper-based 
measures of recognition performance are presented in Table A-1. It can be seen tiiat 
die adjusted reliabiUty estimates are relatively high ranging from .89 to .97. The 
difference in rcUabiUties for computer-based and paper-based measures for average 
degree of confidence was found to be statistically significant (p < .02) using a test 
descnbed by Edwards (1964). However, the difference in reliabilities for computer- 
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based and paper-based measures of the recognition test score was found to be not 
significant These results revealed that (a) the computer-based and paper-based meas- 
ures of test score* were not significantly different in reliability or internal consistency, 
and (b) the paper-based measure of average degree of confidence was more reliable or 
internally consistent than the computer-based n^^asure* 

Equivalence estimates between corresponding computer-based and paper-based 
measures of recognition test score and average degree of confidence were *67 and *81, 
respectively* These suggested that the computer-based and paper-based measures had 
anywhere fitom approximately 45% to 66% variance in common implying that these 
different modes of assessment were only partially equivalent The equivalences for 
test score and average degree of coi:^dence measures were significantiy (p < ,001) 
different This result sug^sted that conaputer-based and paper-based measures of aver- 
age degree of confidence were more equivalent than these measures of recognition test 
score* 



Discriminative Validity 

The multiple discriminant analysis (Gooley & Lohnes, 1962; Tatsuoka, 1971; Van 
de Geer> 1971), which was computed to determine how well computer-based and 
paper-based measures of recognition performance differentiated groups defined by 
above or below mean average curriculum grade, yielded one significant discriminant 
function as expected The statistics associated with the significant function, standard- 
ized discriminant-function coefficients, pooled withih-groups correlations between the 
function and computer-based and paper-based measures, and group centroids for above 
or below mean average curriculum grade are presented iii Table A-2* It can be seen 
that the single significant discriminant fimction accounted for 100% of the variance 
between the two groups* The discriminant-function coefficients which consider the 
interactions among the multivariate measures revealed the relative contribution or com- 
parative importance of the variables in defining this derived dimension to be CTC, 
PTC, FTP, CTP, and CTL. The witiiin-groups correlations which- are computed for 
each individual measure partialling out the interactive effects of all the other variables 
indicated that the major contributors to the significant discririiinant function were CTP, 
CTC, and CTL, respectively, all computer-based measures* The group centroids 
showed that those students whose cunicular performances were above the mean aver- 
age grade clustered together along one end of the derived dimension; while, those stu- 
dents whose curricular perfomoances were below the mean average grade clustered 
together along the other end.of the continuum* 

The means and standard, deviations for groups above or below mean average cur- 
riculum grade, univariate F-ratios, and levels of significance for comput^-based and 
paper-based measures of recognition performance are tabulated in Table A-3* Consid- 
ering the measures as univariate variables (i*e*, independent of their multivariate rela- 
tionships with one another) these statistics revealed (tizt one computer-based measure, 
CTL, and one paper-based iheasure, PTC, significantiy differentiated the two groups* 
The means revealed that the group above mean avoiage curriculum grade had shorter 
computer-based latencies than the group below mean average curriculum grade, and 
that the former group had a higher paper-based average degree of confidence than the 
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latter group. In general, the multivariate" and subsequent uiivariate results established 
-that according to two sete of criteria, the discriminant coefficients and F-ratios and 
.COTTCsponding means, the discriminant validities of computer-based and paper-based 
measures were about- the siune for distinguishing groups above or below mean average 
cumcdum grade, However, according to another set of criteria, the pooled within- 
groups correlations between the discriminant function and the computer-based and 
paper-based measures, the former had superior discriminative validity than the latter. 



Predictive Validity 

The statistics associated with die significant canonical correlation (Cooley & 
Lohnes, 1962) between computer-based and paper-based measures of recognition per- 
fpimance an4 cumcular criteria are presented in Table A-4. These results established 
that die computer-based and paper-based measures of recognition perfonnance were 
significandy associated witfi die curricular criteria. The canonical variates revealed diat 
die major contributors to diis correlation in order of importance were PTG, CTC, CTL, 
BWP, and ADR When the relative magnitudes of die canonical, variates are con- 
sidered, the paper-based measure, PTC, is the most salient contributor to the coirela- 
tion. However, 50 percent of die paper-based measures and 66 percent of die 
computer-based measures were tfie primary contributors to die multivariate relationship 
betweeii recognition peiformance and die basic weapons and advanced fighter air 
superority phases of of die curriculum. The univariate relationships among die above 
five majOT- contributors to die canonical correlation as reflected by die Pearson 
product-moment conelatioiis revealed diat CTL and PTC were significandy associated 
widi BWP and ADF. Neverdieless, die differences in the strengdi of die associations of 
CTL and PTC widi BWP and ADF were found to be not significandy different AU of 
diese statistics associated witfi die canonical correlation suggested die predictive vali- 
dity of computer-based measures approximates diat of paper-based measures. 



DISCUSSION 

This study established diat die relative reliability of computer-based and paper- 
based measures depends upon die specific criterion assessed That is, regarding die 
recognition test score itself, it was found diat computer-based and paper-based meas- 
ures were not significandy different in reliability or internal consistency. However, 
regarding die average-degree of confidence in recognition judgments, it was found diat 
die paper-based measure was more reliable or internally consistent dian its computer- 
based counteipart The extent of die equivalence between diese two modes of measure- 
ment was contingent upon particular perfonnance criteria. It was demonstrated diat 
the equivalence of computer-based and paper-based measures of average degree of 
confidence was greater dian diat for recognition test score. The relative discriminative 
validity of computer-based and paper-based m<jasures was dependent upon the specific 
statistical criteria selected. The discriminant coefficients, F-ratios, and corresponding 
means indicated diat die validities of computer-based and paper-based measures were 
about die same for distinguishing groups above or below mean average curriculum 
grade. However, according to anodier set of criteria, die pooled widiin-groups 



correlations, between the discriminant function and computer-based and paper-based 
ineasures, the former had superior validity than the latter. Also, according to the statis- 
tics associated with, the canonical correlation; this study demonstrated that the predic- 
tive validity of computer-based measures approximates that of paper-based measures. 
The results of this research supported the findings of some studies, but not others* As 
will be discussed, the reported literature on this subject is contradictory and incon- 
clusive* 

Federico and Liggett (1988, 1989) administered computer-based and paper-based 
tests of tiueat-parameter knowledge (Liggett & Federico, 1986) in order to determine 
the relative reliability and validity of these two modes of assessment Estimates of 
int^al consistendes, equivalences, and discriniinant validities were computed They 
established that computer-based and pap^-based measures (i^e*, test score and average 
degree of confidence) were not significandy different in reliability or internal con- 
sistency. This finding partially agrees with the corresponding result of this present 
study ^sihce computer-based and paper-based n^ of test score were found to be 
equally reliable; however, the computer-based measure of average degree of confidence 
was found to be less reliable than its paper-based counterpart A few of the Federico 
and Liggett findings* were ambivalent siiice some results suggested equivalence esti- 
mates for computer-based and paper-based measures (i*e., test score and average 
degree of confidence) were «about the same, and another suggested these estimates are 
different Some of this reported result.is different from that established in this present 
study where conlputdr-based and paper-based measures of test score were less 
equivaieht than these measures of average degree of confidence* Lasdy, Federico and 
Liggett demonstrated that the discriminative validity of the computer-based measures 
was superior to paper-based measures* This result is in partial agreement with that 
found in this reported research where this was also established with rcspect to some 
statistical criteria* However, according to other criteria, the discriminative validity of 
computer-based and paper-based measures were about the same* 

Hofer and Grcen (1985) were concerned that computer-based assessment would 
introduce irrelevant or extraneous factors that would likely degrade test performance* 
These comput^-corfelated factors may alter the nature of the task to such a degree, it 
would be difficult for a computer-based test and its paper-based counterpart to measure 
the same construct or content This could impact upon reliability^ validity, normative 
data, as well as other assessment attributes* Several plausible reasons, they stated, may 
contribute to different performances on these distinct kinds of testing: (a) state anxiety 
instigated when confronted by computer-based testing, (b) lack of computer familiarity 
on the part of the testee, and (c) changes in response format required by the two 
modes of assessment These differcnt dimensions could result in tests that are none- 
quivalent; however, in this reported research these diverse factors had no apparent 
impact 

On the other hand, there are a number of known differences between computer- 
based and paper-based assessment which may affect equivalence and validity: (a) Pas- 
sive omitting :of items is usually not permitted on computer-based tests* An individual 
must respond unlike most paper-based tests* (b) Computerized tests typically do not 
permit backtracking. The testee cannot easily review items, alter responses, or delay 
attempting to answer questions* (c) The capacity of the computer screen can have an 
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impact on what usuaUy are long test items. e.g., paragraph comprehension. These may 
be shortehed;to accommodate the computer display, thus partially changing the nature 
of the task, (d) The quality of computer graphics may affect the comprehension and 
degree of difficulty, of the item, (e) Pressing a key or using a mouse is probably easier 
than. maiking an answer sheet. This may impact upon the vaUdity of speeded tests, (f) 
Since the computer typicaUy displays items individually, traditional time limits are no 
longer necessary (Green, 1986). 

Sampson (1983) discucsed some of the potential problems associated with 
computer-based assessment: (a) not taking into account human factors principles to 
design :the human-coii5)uter interface, (b) individuals may become anxious to such a 
degree when having to interact with a computer for assessment that the measurement 
obtaiiied may be questiohabic, (c) unauthorized access and invasion of privacy are just 
soniw of the abuses that can result from computerized testing, (d) inaccurate test 
interpretations by users of the system can easily culminate in erroneously drawn con- 
clusions, (e) differences in modes of administration may make paper-based norms 
inappropriate for computer-based assessment, (f) lack of reporting reUabiUty and vaU- 
dity data for computerized tests, and (g) resistance toward using new computer-based 
systems for performance assessment A potential lunitation of computer-based assess- 
ment is depersonaUzation and decreased opportunity for observation. This is especiaUy 
true in clinical environments (Space, 1981). Most computer-based tests do not allow 
individuals to omit or skip items, or to alter earlier responses. This procedure could 
change the test-taking strategy of some examinees. To permit it, however, would prob- 
ably create confusion and hesitation during the process of retracing through items as 
the testee uses clues from some to miniinize the degree of difficulty of others (Green 
Bock, Humphreys, Linn, & Reskase, 1984). 

Some of the comments made by Colvin and qaik (1984) concerning instructiohal 
media can be easUy extraipolated to assessment media. (Training and testing are inex- 
tricably intertwined; it is difficult to do one well without the other.) This is especially 
appropnate regarding some of the attitudes and assumptions permeating the employ- 
ment of, and enthusiasm for, media: (a) confronted with new media; computer-based or 
otherwise, students will not only work harder, but also enjoy their training and testing 
more; (b) matehing training and testing content to mode of presentation is important 
even though not all that prescriptive or empiricaUy weU estabUshed; (c) the application 
of computer-based systems permits self-instruction and self-assessment with their con- 
comitant flexibility in scheduling and pacing training and testing; (d) monetary and 
human resources can be invested in designing- and developing computer-based media 
-for mstraction and assessment that can be used repeatedly and amortized over a longer 
time, rather than in labor intensive classroom-based training and testing; and (e) the 
stability and consistency of instruction and assessment can be improved by media 
computer-based or not* for distribution at different times and locations however 
remote. 

When evaluating or compaiing different media for instruction and assessment, the 
newer medium may simply be perceived as being more interesting, engaging, and chal- 
lenging by the students. This novelty effect seems to disappear as rapidly as it. 
appears. However, in research studies conducted over a relatively short time span, e.g 
a few days or months at die most, this effect may still be lingering and affecting the 



evaluation by enhancing the impact of the more novel medium (Colvin & Clark, 
1984)* When matching media to distinct subject matters, course contents, or core con- 
cepts,, some research* evidence (Jamison,:Suppes, & Welles, 1974) indicates that, other 
than in obvious, cases, just about any medium wUl be effective for different content 

Another salient question that should be addressed is: How to combine effectively 
and efficiendy computer and cognitive science, artificial intelligence (AI) technology, 
current psychometric theory, and diagnostic testing? It has been demonstrated (Brown 
& Burton, 1978; Kieras, 1987; McArthiir & Choppin, 1984; Wenger, 1987) tiiat AI 
techniques can be developed to diagnose specific error-response patterns or bugs to 
advance measurement methodology* 

RECOG together with the Soviet and n6n*Soviet aircraft silhouette database is 
referred to as FLASH IVAN. This system is currentiy being used to augment the 
leaching and testing of this subject matter in VF-124, RECCXj was designed and 
developed wiUi generalizability (i*e*, independence of subject-matter domain) and 
transferability (i.e., capable of readily running on different computer systems) in mind 
as was tiie Computer-Based Educational Software System (CBESS) (Brandt, 1987; 
Brandt, Gay, Othmer & Halff, 1987).. (TBESS consists of a number of component 
quizzes such as JEOPAWDY, TWENTY QUESTiONS, and CONSTRAINT. Since tiie 
time tiiat RECOG and Hj^SH IVAN were developed arid evaluated, two otiier tests, 
FLASH and PICTURE, were added to CBESS. In terms of tiieir function, tiiese addi- 
tional quizzes are similar to RECOG and FLASH IVAN. However, these more 
recently produced tests are written in the "C" programming language for the Navy 
standard microcomputer. Zenith Z-248. Consequentiy, since TERAK computers are no 
longer being produced, the company went out of business, FLASH and PICTURE can 
be perceived as replacements for RECOG and FLASH IVAN. 

RECOMMENDATIONS 

Based upon the findings of this study, the following actions are recommended: 

(a) Commander, Naval Air Force, U.S. Pacific Fleet, use FLASH and PICTURE 
to supplenient the training and testing of fighter and other crew members to recognize 
Soviet and Non-Soviet silhouettes. 

(b) Chief of Naval Operations, Total Force Training and Education, fund the 
evaluation and seek implementation of FLASH and PICTURE in other content areas or 
.subject-matter domains (e.g., ship silhouettes, electronic schemata, human anatomy) to 
ascertain the univei^ality of- the validity and reliability results^ established in this 
reported research. 
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TABLES OF RELIABILITY AND VALIDITY ESTIMATES 
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Discriminant-Function Coefficients, Pooled Within-Groups Correlation^ Between 

the Discriminant Function and Computer-Based and Paper-Based Measures, 

:uid Group Centroids for Above or Below Mean Average Curriculum Grade A- 

A-3- Means and Standard Deviations for Groups Above or Below 

Mean Average Grade, iJnivariate F-Ratios, and Levels of Significance 

for Computer-Based and Paper-Based Measures A- 

A-4. Statistics Associated with Significant Canonical Correlation Between 
Computer-Based and Paper-Based Measures of - 

Recognition Perfomiance A- 



26 

A-O 



Table A-1 

Split^Half Reliability and Equivalence Estimates of Computer*Based 
and Paper-Based Measures of Recognition Performance 



Measure 



Reliability 



Computer- Paper- 
Based Based 



Equiva- 
lence 



Score 

Confidence 
Latency 



.90 

.95 
.93 



.89 
.97 



.67 
.81 



Note. Split-half reliability estimates were adjusted by 
employing the Spearman-Biown Prophecy Formula. 



27 

A-l 



Table A-2 

Statistics Associated with Significant Discriminant Function, Standardized 
Discriminant-Function Coefficients, Pooled Within-Groups Conelaiions Between 
the Discriminant Function and Computer-Based and R^r-Based Measures, 
and Group Centroids for Above or Below Mean Average Curriculum Grade 

iv 

Discriminant Function 



Eigen- 
value 


Percent 
Variance 


Canonical 
Correlation 


WiOcs 
Lambda 


Chi 
Squared 


0.1. 


P 


.14 


100.09 


.35 


.88 


9.98 


5 


076 


Measure 


Discriminant 
Coefficient 


Wilhin-Group 
Correlation 






Group 


Centroid 




OTP 
CTC 
CTL 


.60 

.97 
.52 


-.60 
.55 
-.48 






Above Mean 
Average Grade 

Below Mean 
Average Grade 


-.32 
.42 




FTP 


-.80 


-.25 












PTC 


-.94 


.03 
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Table A-3 

Means and Standard Deviations for Groups Above or Below 
Mean Average Grade, Univariate F*Ratios, and Levels of 
Significance for Computer-Based and Paper-Based Measures 

Group 

Above Mean Below Mean 
Measure Average Grade Average Grade ? ? 



CIP ^ "-'^ "-"^ .01 .92 

^^'^ S 18.48 23.33 

X 90.99 88.54 ^^ ,q 

^ § 12.74 13.06 ''^ ■^'^ 

X 1522.06 2115.61 fv, 

s 1554.12 1359.19 

PIP ^ 256 .11 

s 16.65 17.19 

FTC ? ^^-S 352 .05 
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Table A.4 

Statistics Associated with the Significant Canonical 
Cdnelation Between Computer-Bascd and P^r-Bascd 
Measures of Recognition Perfonnancc and Curricular Criteria* 



Canonical Eigcn- WiDcs Chi 

Correlation Value Lambda Squared 



•26 J55 45.13 25 



Cbmputer-Based 








and Pj^-and- 


Canonical 


Curricular 


Canonical 


Pencil Measures 


Variate 


Criteria 


Variate 


CIP 




FAM 


-.11 


CTC 


-1.06 


BWP 


.65 


CTL 


-.73 


GUN 


.02 


FTP 


-21 


TAC 


-.02 


PTC 


1.17 


ADF 


34 


Pearson Ptoduct-Moment Correlations 




BWP 


ADF 




CTC 


.08 


.15 




CTL 




-.35^ 




PTC 


29^ 


.27* 





Note: a.r(81)>.256-rp<.0l. 
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